line search
6d0bf1265ea9635fb4f9d56f16d7efb2-Supplemental-Conference.pdf
Supplementary Materials for "Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models" Appendix A The Algorithm Appendix B Convergence Rates Appendix B.1 Rate of Convergence for Strongly Convex Functions Appendix B.2 Rate of Convergence for Convex Functions Appendix B.3 Rate of Convergence for Functions Satisfying the PL Condition Appendix B.4 Common Lemmas Appendix B.5 The Polyak Step Size is Bounded Appendix C Experimental details Appendix D Plots Completing the Figures in the Main Paper Appendix D.1 Comparison between PoNoS and the state-of-the-art Appendix D.2 A New Resetting Technique Appendix D.3 Time Comparison Appendix D.4 Experiments on Convex Losses Appendix D.5 Experiments on Transformers Appendix E Additional Plots Appendix E.1 Study on the Choice of c: Theory (0.5) vs Practice (0.1) Appendix E.2 Study on the Line Search Choice: V arious Nonmonotone Adaptations Appendix E.3 Zoom in on the Amount of Backtracks Appendix E.4 Study on the Choice of η In this section, we give the details of our proposed algorithm PoNoS. Training machine learning models (e.g., neural networks) entails solving the following finite sum problem: min Before that, we establish the following auxiliary result. The following Lemma shows the importance of the interpolation property. Lemma 4. W e assume interpolation and that f Let us now analyze case 2). Let us now show that b < 1. B.2 Rate of Convergence for Convex Functions In this subsection, we prove a O ( The above bound will be now proven also for case 2).
6d0bf1265ea9635fb4f9d56f16d7efb2-Paper-Conference.pdf
Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function.
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- North America > Canada > British Columbia (0.04)
- Europe > Russia (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models
Recent works have shown that line search methods can speed up Stochastic Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, existing line searches may take steps that are smaller than necessary since they require a monotone decrease of the (mini-)batch objective function. We explore nonmonotone line search methods to relax this condition and possibly accept larger step sizes. Despite the lack of a monotonic decrease, we prove the same fast rates of convergence as in the monotone case. Our experiments show that nonmonotone methods improve the speed of convergence and generalization properties of SGD/Adam even beyond the previous monotone line searches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained by combining a nonmonotone line search with a Polyak initial step size. Furthermore, we develop a new resetting technique that in the majority of the iterations reduces the amount of backtracks to zero while still maintaining a large initial step size. To the best of our knowledge, a first runtime comparison shows that the epoch-wise advantage of line-search-based methods gets reflected in the overall computational time.
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution.
Comparing BFGS and OGR for Second-Order Optimization
Przybysz, Adrian, Kołek, Mikołaj, Sobota, Franciszek, Duda, Jarek
Across standard test functions and ablations with/without line search, OGR variants match or outperform BFGS in final objective and step efficiency, with particular gains in nonconvex landscapes where saddle handling matters. Exact Hessians (via AD) are used only as an oracle baseline to evaluate estimation quality, not to form steps. II. Online Gradient Regression (OGR) Online Gradient Regression (OGR) is a second-order optimization framework that accelerates stochastic gradient descent (SGD) by online least-squares regression of noisy gradients to infer local curvature and the distance to a stationary point [3]. The central assumption is that, in a small neighborhood, the objective F (θ) is well-approximated by a quadratic model, so the gradient varies approximately linearly with the parameters. OGR maintains exponentially weighted statistics of recent (θ t, g t) pairs and updates a local model each iteration at negligible extra cost compared to computing the gradient itself [2], [3]. A. Direct multivariate approach In given time T, based on recent gradients g t R d and positions θ t R d for t < T, we would like to locally approximate behavior with 2nd order polynomial using parametrization: f (θ) = h + 1 2 (θ p) T H(θ p) f = H(θ p) for Hessian H R d d and p R d position of saddle or extremum. For local behavior we will work on averages with weights w t further decreasing exponentially, defining averages: v null t
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- Europe > Poland > Greater Poland Province > Poznań (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- North America > United States > Illinois > Cook County > Lemont (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
high_prob_ls_nonconvex_final
We consider a line-search method for continuous optimization under a stochastic setting where the function values and gradients are available only through inexact probabilistic zeroth and first-order oracles. These oracles capture multiple standard settings including expected loss minimization and zeroth-order optimization. Moreover, our framework is very general and allows the function and gradient estimates to be biased. The proposed algorithm is simple to describe, easy to implement, and uses these oracles in a similar way as the standard deterministic line search uses exact function and gradient values. Under fairly general conditions on the oracles, we derive a high probability tail bound on the iteration complexity of the algorithm when applied to non-convex smooth functions. These results are stronger than those for other existing stochastic line search methods and apply in more general settings.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)